Path: ...!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Interpreter loops vs branch predictors
Date: Mon, 07 Sep 2015 12:25:07 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 187
Message-ID: <2015Sep7.142507@mips.complang.tuwien.ac.at>
References: <a1870856-6dfd-44f3-a94c-5f444f545e93@googlegroups.com> <2015Aug10.194919@mips.complang.tuwien.ac.at> <2015Aug11.090812@mips.complang.tuwien.ac.at> <4f9355a3-4f22-4428-9ef7-c5fb3e54a1fd@googlegroups.com>
Injection-Info: mx02.eternal-september.org; posting-host="d47d3421039fe8026514328ad0ebacae";
	logging-data="30576"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19B6PxvVzfuGwTkieGCHRSd"
X-newsreader: xrn 10.00-beta-3
Cancel-Lock: sha1://EMmKcrcKugh4t9QqfnfOBaWTY=
Bytes: 9775

Bruce Hoult <bruce.hoult@gmail.com> writes:
>Do you not have a Haswell machine available? I'm sure someone will have if =
>you have test binaries available.

I now have access to a Core i7-4790K (Haswell); here are the results
for the same things I have done before on the Ivy Bridge; Turbo Boost
was disabled on the Haswell (i.e., it ran with 4GHz).  Run time for
individual benchmarks in seconds user time, branch misses reported for
running them all.

for j in ' --ss-number=0 --ss-states=0 --no-dynamic' ' --ss-number=0 --ss-states=0 --no-super' ' --ss-number=0 --ss-states=0'; do for k in gforth-fast-4.8-PR15242 gforth-fast-4.8; do i=$k$j; echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done; done
gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
 sieve bubble matrix   fib   fft
 0.264  0.196  0.132 0.228 0.088
13039125  branch-misses
gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs
 sieve bubble matrix   fib   fft
 0.112  0.152  0.092 0.164 0.060
11273919  branch-misses
gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-super onebench.fs
 sieve bubble matrix   fib   fft
 0.120  0.160  0.088 0.156 0.052
11457651  branch-misses
gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-super onebench.fs
 sieve bubble matrix   fib   fft
 0.112  0.152  0.088 0.152 0.056
11107473  branch-misses
gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 onebench.fs
 sieve bubble matrix   fib   fft
 0.096  0.128  0.052 0.100 0.032
8690210  branch-misses
gforth-fast-4.8 --ss-number=0 --ss-states=0 onebench.fs
 sieve bubble matrix   fib   fft
 0.088  0.120  0.052 0.088 0.036
7687416  branch-misses

gforth-fast-4.8-PR15242 is built with gcc-4.8, with one shared
indirect jump (that is jumped-to with direct jumps, similar to what
switch-based interpreters usually generate), while gforth-fast-4.8 has
a separate indirect jump at the end of each VM instruction
implementation; "--no-dynamic" (the first two) uses that directly.
"--no-super" uses dynamic replication (and gives each replicated
implementation its own indirect branch); and the last two runs use
dynamic superinstructions with replication (also with a separate
indirect branch at the end of each dynamic superinstruction).  See
<2015Aug10.194919@mips.complang.tuwien.ac.at> for a more detailed
description.

The branch mispredictions are indeed not that much different between
versions (all better than the best on the Ivy Bridge), but there is
still quite a bit of performance difference between the first two
(just from the extra direct jump); dynamic replication does not give a
better branch prediction on the Haswell, and consequently no better
performance than gforth-fast-4.8 --no-dynamic.  But superinstructions
with replications do provide a nice speedup again probably due to
fewer branches executed.

So, if you conclude from the paper that optimizing the dispatch of
interpreters is no longer relevant, that's not even true on Haswell
(and certainly not on any other CPU).  On mostr benchmarks the last
version is more than twice as fast as the first one, and the first one
is probably faster than a switch-based interpreter.

But the paper is correct that the branch prediction of the Haswell is
much better than that of earlier CPUs.  The difference is most extreme
for gforth-fast-4.8-PR15242 --no-dynamic: 71M mispredictions on Ivy
Bridge, 13M on Haswell
(Ivy Bridge results: <2015Aug11.090812@mips.complang.tuwien.ac.at>).

Let's look at more substantial benchmarks, from the Forth appbench
suite (Ivy Bridge results in the same posting
<2015Aug11.090812@mips.complang.tuwien.ac.at>):

Run time in seconds:

 shared  non-shared
  --no-       --no-      --no-
dynamic     dynamic      super  default
  3.332       2.440      2.276    1.468 benchgc  
  1.652       1.524      1.064    0.544 brainless
  4.016       3.416      2.288    1.520 brew     
  3.420       3.236      2.232    1.232 cd16sim  
  2.956       2.684      1.484    0.864 fcp      
 13.128       9.848      9.280    7.840 lexex    
  
Again, speedup factors 1.67-3.42 from interpreter optimizations from
the most switch-like version to the Gforth default, and also good
speedups for separate indirect jumps over the shared indirect jump.

For mispredictions the picture was as follows:

      shared   non-shared
        --no-       --no-       --no-
      dynamic     dynamic       super     default
   24,596,739  17,005,970  12,584,698   9,426,835 benchgc  
  163,953,064 178,234,917  51,375,412  18,391,920 brainless
  281,856,378 279,851,195  47,928,843  21,390,209 brew     
  268,684,916 297,000,355  70,879,605  22,105,620 cd16sim  
  255,802,864 293,473,631  33,496,775  17,280,944 fcp      
   39,426,063  12,267,065   8,269,104   6,885,712 lexex    

We don't see a consistent benefit from the separate indirect jumps
here; replication does help a lot for a number of benchmarks, though,
and dynamic superinstructions with replication even more.  These
differences in branch prediction can explain the differences in
run-time between the second and the third column.  So, even with
Haswell's branch predictor, there are significant branch prediction
benefits from replication resulting in significant speedups.  I'll
leave the explanation of these results to the branch predictor
experts.

The difference from Ivy Bridge
<2015Aug11.090812@mips.complang.tuwien.ac.at> is quite big, especially
for the first lexex column: 2590M mispredictions on Ivy Bridge, 39M on
Haswell.

Finally, the results from my interpreter dispatch microbenchmark
<https://www.complang.tuwien.ac.at/forth/threading/>
(cf. <2015Aug11.160837@mips.complang.tuwien.ac.at> for Ivy Bridge
results):

subrout direct indirect switch    call  repl-sw
  0.00    0.02    0.02    0.14    0.15    0.11  

Branch mispredictions:
for i in subroutine direct indirect switch call repl-switch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done
      3835 subroutine
      3302 direct
      3863 indirect
      9083 switch
      4419 call
      6018 repl-switch

Close to perfect branch prediction for all variants (there are 100M
indirect branches/calls in the benchmarks except subroutine).  There
is still a speed difference, probably due to the additional overhead
of switch, call, and repl-switch: direct and indirect use 1 cycle/VM
instruction, switch 5.8, call 6.1, and repl-switch 4.4.  It's
interesting that, even without the branch prediction advantage that
was the motivation for repl-switch, repl-switch is still quite a bit
faster than switch (at least for this benchmark, for bigger programs
the larger cache load from the replicated switch tables could become a
problem.

My conclusion: Haswell's indirect branch prediction is really much
better than before, but for larger programs running on an interpreter
like Gforth, replication still provides substantial branch prediction
improvements that result in significant speedups.  And while there is
no longer a branch prediction advantage to keeping separate indirect
branches for dispatch, there is still a significant speed advantage;
and dynamic superinstructions also provide a good speedup, resulting
in overall speedups by factors 1.67-3.42 for the application
benchmarks.

Why are the results here different from those in the paper?
1) Different Interpreter 2) different benchmarks.  If you write an
interpreter, and look for performance, should you go for interpreter
optimizations like threaded-code, replication, and dynamic
superinstructions like I suggest, or just use a switch-based
interpreter like the paper suggests?  Threaded code is a good idea
with little cost in any case.  If that provides a significant speedup
and your VM instruction implementations are short (you run into cache
trouble with long VM instructions [vitale&abdelrahman04]), then
replication with superinstructions will probably give a good speedup.

@InProceedings{vitale&abdelrahman04,
  author =	 {Benjamin Vitale and Tarek S. Abdelrahman},
  title =	 {Catenation and Specialization for {Tcl} Virtual
                  Machine Performance},
  crossref =	 {ivme04},
  pages =	 {42--50},
  year =	 {2004},
  OPTannote =	 {}
}

@Proceedings{ivme04,
  booktitle = 	 {IVME '04 Proceedings},
  title = 	 {IVME '04 Proceedings},
  year = 	 {2004},
  OPTeditor = 	 {}
}

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
========== REMAINDER OF ARTICLE TRUNCATED ==========